#install.packages("pacman")
pacman::p_load(here)
here()[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book"
In this section, we will be learning how to import, and export data from R. We will also be talking about the different file types. This section is based on the relevant chapters from two of the renowned textbooks on tidyverse.1 These textbooks take different approaches for importing and working with data in RStudio using tidyverse packages. We present to you the most optimal workflows to facilitate reproducibility and ease of understanding.
Before diving into data analysis and working with R, it’s crucial to establish a well-organized workflow. Setting up an R project for each analysis in RStudio is one of the best practices for maintaining this structure. Here’s why it matters:
Organized Workspace: An R project creates a dedicated workspace, keeping all files, scripts, and data for each analysis in one place. This structure makes it easier to locate and manage your resources and helps prevent clutter on your computer.
Consistent File Paths: When working within an R project, file paths become relative to the project’s root directory. This avoids the need for absolute paths (e.g., C:/Users/YourName/ProjectFolder), making your code portable. For example, using relative paths allows you to share your project with others without requiring adjustments to file paths.
Enhanced Reproducibility: With an R project, you can easily recreate your analysis environment. The .Rproj file saves specific project settings, allowing you to return to the project later and pick up where you left off with minimal setup. This is particularly valuable when revisiting analyses or sharing work with collaborators.
To create a new project, open RStudio, go to the File menu, select New Project, and follow the prompts. You’ll see that RStudio sets up a unique working directory, which helps maintain consistency and clarity throughout your analysis.
Or you could try
By following this practice, you set up a solid foundation for a clean, organized, and reproducible workflow in R.
hereReading and writing files often involves the use of file paths. A file path is a string of characters that point R and RStudio to the location of the file on your computer.
These file paths can be a complete location (C:/Users/Arun/RIntro_Book.Rmd) or just the file name (RIntro_Book.Rmd). If you pass R a partial file path, R will append it to the end of the file path that leads to your working directory. The working directory is the directory where your .Rproj file is.
When working with files in R, defining paths correctly is essential for accessing your data and saving outputs. The here package is a powerful tool that simplifies file paths, especially within R projects, by automatically locating the root directory of your project.
here?Simplifies Paths: Instead of typing out long, complex file paths, here constructs paths relative to the root of your project. This makes your code cleaner and easier to read.
Improves Portability: Using here makes your code more portable. When sharing your project with others or switching between computers, the paths generated by here adjust automatically based on the project’s root, so there’s no need to modify paths manually.
Avoids Path Errors: Typing out file paths can lead to errors if you move files around or change directories. The here function helps prevent these issues by always starting paths from the same project root.
Using here in Practice The here package creates paths by combining the project root directory with any subdirectories or file names you specify. For example:
#install.packages("pacman")
pacman::p_load(here)
here()[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book"
When you run here::here() in your R project, it returns the full file path up to the directory where your R project was created. This directory is known as the project root.
If you have a file named nhanes_modified_df.rds stored inside a folder called data within your project, you can easily reference it using the here function. By writing:
here("data", "nhanes_modified_df.rds")[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_modified_df.rds"
you’re creating a file path that points directly to the nhanes_modified_df.rds file within the data folder, starting from the root of your project. This method keeps things neat, adaptable, and prevents hardcoding of long file paths. Whether you move your project to another computer or share it with someone else, this path will still work without any changes. It’s a simple way to make your workflow more efficient!
The RStudio IDE provides an Import Dataset button in the Environment pane, which appears in the top right corner of the IDE by default. You can use this button to import data that is stored in plain text files as well as in Excel, SAS, SPSS, and Stata files.
We recommend using .csv file type to read and write your data as a best practice. This will ensure cross compatibility between various programs as it is just a raw text file but just separated by a comma.
.rds) using a Library.rds is a file format native to R for saving compressed content. .rds files are not text files and are not human readable in their raw form. Each .rds file contains a single object, which makes it easy to assign its output directly to a single R object. This is not necessarily the case for .RData files, which makes .rds files safer to use.
We can use the read_rds() and write_rds() function from the readr package to read and write an .rds file. write_rds() function save the previously loaded data, as an .rds file using this function. You can look at the help menu to know more on the syntax or you can type ?write_rds in the Console pane.
eg:
df <- readr::read_rds(here("data", "nhanes_modified_df.rds"))In the above line of code we are instructing R to:
Look inside the project folder: here::here("data", "nhanes_modified_df.rds") tells R to look in the data folder within your project for a file named nhanes_modified_df.rds.
Read the .rds file: readr::read_rds() is used to load this .rds file into the object df.
However, if there is:
A spelling mistake in either the folder name (data) or the file name (nhanes_modified_df.rds), or
The file doesn’t exist at the specified location,
R will not be able to find the file, and you’ll encounter an error message, typically saying the file cannot be found.
Similarly, you can use the write_csv() function from the readr package to write a .csv file.
try!!!
Note
There are different packages to import different types of data.
haven : SPSS, Stata, or SASreadxl : Excel spreadsheetsreadr : csv, txt, tsv etc.When working with data in R, you’ll frequently encounter two common types of data structures: tibbles and data.frames. While both are used to store tabular data, they have some important differences that affect how they behave and how you interact with them. Understanding these differences can help streamline your data analysis and avoid potential pitfalls.
To learn more in-depth about tibbles, you can run vignette(“tibble”) in your R console, which provides a comprehensive overview.
Some major differences are:
data.frame changes strings as factors; tibble will notdata.frame will remove spaces or add “x” before numeric column names. tibble will not.row.names() for a tibbletibble print first ten rows and columns that fit on one screenTidy data is a way to describe data that’s organized with a particular structure – a rectangular structure, where each variable has its own column, and each observation has its own row. — Hadley Wickham, 2014
These three rules are interrelated because it’s impossible to only satisfy two of the three.
Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham
Working with messy data can be messy!. You need to build custom tools from scratch each time you work with a new dataset.
Illustrations from : https://github.com/allisonhorst/stats-illustrations
Packages like tidyr and dplyr can enable you to get on with analysing your data and start answering key questions rather than spending time in trying to clean the data.
Note
Tidy data allows you to be more efficient by using specialised tools built for the tidy workflow. There are a lot of tools specifically built to wrangle untidy data into tidy data.
One other advantage of working with Tidy data is that it makes it easier for collaboration, as your colleagues can use the same familiar tools rather than getting overwhelmed with all the work you did from scratch. It is also helpful for your future self as it becomes a consistent workflow and takes less adjustment time for any incremental changes.
Tidy data also makes it easier to reproduce analyses because they are easier to understand, update, and reuse. By using tools together that all expect tidy data as inputs, you can build and iterate really powerful workflows.
When loading data into R using the RStudio GUI using tidyverse, the data is automatically saved as a tibble. A tibble is a data frame, but they have some new functionalities and properties to make our life easier. It is the single most important workhorse of tidyverse.
You can change data.frame objects to a tibble using the as_tibble() function.
Now that you have imported data into RStudio its a good practice to have a look at the data. There are many ways you can do it within RStudio.
View() functionSome other things you can do to have a look at your data are:
Checking the class of the dataset using class() function
Checking the structure of the dataset using str() function
Note
class() and str() are not just limited to datasets, they can be used for any R objects.
Some additional tips for quickly looking at your data:
head()tail()glimpse()Type the name of the dataset in the console and see what happens?
How many rows and columns can you visualize?
Now, try the head(), tail(), and glimpse() functions
Try to create a tibble manually in RStudio with a numeric, character, and factor variable. (Hint: vignette(‘tibble’) )